Identifying cell type-defining genes for translation to flow cytometry

Rationale

Inflammatory immune cells, such as microglia and monocytes, play a significant role in disease progression. Accurately identifying these immune cells is essential for understanding their specific contributions to pathology. Distinguishing microglia, the resident immune cells of the central nervous system, from other infiltrating immune cells like monocytes, is particularly critical, as these cells often assume opposing roles during disease. However, this distinction is challenging due to their tendency to adopt similar phenotypes and transcriptional profiles under pathological conditions.

Single-cell RNA sequencing has emerged as a powerful technique that enables the quantification of thousands of genes at the single-cell level, revolutionizing our understanding of immune cell heterogeneity. This approach allows for the precise identification of genes that distinguish specific cell types and immune states. Despite these advancements, single-cell RNA sequencing is costly, requires specialized training, and involves complex multi-omics data analysis. Therefore, there is a need to develop a method that identifies key cell type-defining genes that can be translated into more cost-effective and widely accessible techniques, such as conventional flow cytometry.

The aim of this project is to identify cell type-defining genes from single-cell RNA sequencing data that can be applied to flow cytometry. Highly variable genes within the dataset will be identified to reduce the dimensionality of the data, enabling downstream clustering into distinct cell types. Cell type or subset identity will be predicted using a random forest classifier based on the features essential for classification. The key genes defining each cell type will be identified by their importance, as determined by the random forest analysis, and visualized in a flow cytometry-like plot to simulate a gating strategy commonly used in this technique.

Hypotheses

  1. Dimensionality reduction will capture patterns in the expression of highly variable genes that distinguish between (a) cell types (e.g., monocytes and microglia) and (b) cell subsets.
  2. A small number of genes can accurately distinguish these cell types (e.g., microglia vs monocytes) or cell subsets by their features (expression of variable genes).

Methods

Dataset selection

The dataset used in this report is a single cell RNA-sequencing dataset of immune cells. These immune cells are derived from spinal cords of mice in one of two conditions: (1) experimental autoimmune encephalomyelitis (EAE, an animal model of Multiple Sclerosis) or (2) phosphate-buffered serum (PBS)-injected controls. This dataset was downloaded from the Gene Expression Omnibus, a public data repository hosting gene expression data (dataset GSE#: GSE146113).

Pre-processing

Data cleaning

Data was converted into a S4 Seurat object with the Seurat package. All subsequent analyses were performed on the Seurat object. Low quality cells were filtered out of the dataset prior to analysis using the Seurat ‘subset’ function. Low quality cells were defined as cells with greater than five percent mitochondrial gene expression and less than 250 reads per cell. Doublets were defined as cells with greater than 2500 genes/cell and were excluded from analysis.

Normalisation

Gene expression measurements for each cell was log normalized by the total expression at a scale factor of 1000 using the Seurat function ‘Normalize Data’.

Dimensional Reduction

Defining variable features

Variables genes were identified for subsequent dimensional reduction using the ‘FindVariableGenes’ Seurat function, which selects genes displaying a variance/mean ratio above 0.1. The average expression and dispersion was then calculated for each gene, placed into a bin and given a z-score for dispersion in each bin.

Data scaling and centring

Prior to dimensional reduction, gene expression was centered and scaled using the Seurat function ‘ScaleData’. This function centers the expression of each gene by subtracting the average expression of the gene for each cell. Scaling divides the centered gene expression levels by the standard deviation.

Principal component analysis

Principal components (PCs) were calculated using the ‘RunPCA’ Seurat function. Each PC represents a “metagene” that reduces information across a correlated gene set (identified as the top variable genes). PC selection for downstream clustering was defined by elbow plots. The ‘elbow’ was defined by taking the larger value of the point where the (1) principal components cumulatively contribute to 90% of the standard deviation and (2) percent change in variation between the consecutive PCs is less than 0.1%. Based on these metrics, 11 PCs were used for downstream clustering.

Cell type classification

Clustering

Cells were grouped in distinct cell clusters using the ‘FindClusters’ function in Seurat with the number of PCs defined above. Cells were grouped using K-nearest neigbhour clustering, which draws edges between cells with similar expression patterns.

Benchmarking

Prior to analysis, scaled and centred data was randomized and split into training (80% of data) and testing (20% of data) subsets. Variable importances based on random forests were calculated for each gene by the ‘Ranger’ package, which fits a classification tree using the implementation provided by the partykit R package. The function FindAllMarkers from the ‘scTree’ package was used to define importance p-values for each gene in defining each cell susbet or cell type.

Random forest classifier

A classifier was trained by selecting the top three genes per target (defined by their variable importances). These features were then used to train a prediction model on the training set, and the predictions were then applied to the testing set.

Validation

Model performance was evaluated by the caret package. The random forest model (trained on ‘training’ set) was applied to the ‘testing’ set. Predictions were then compared with the true cell type/cell subset identity. The ConfusionMatrix function was used to calculate a cross-tabulation of observed and predicted cell type/cell subset identities and the relevant statistics.The function MultiClassSummary was used to calculate overall measures of performance (e.g., overall accuracy and the Kappa statistic).

Flow-style plots

Faceted scatter plots were created using the top three genes (selected by highest importance) for classifying each cell type (either ‘Microglia’ or ‘Monocytes’) using the function ‘plot_flowstyle’ by the sctree package. These plots were used to simulate flow cytometry-like plots from the identified important features.

Results

Pre-processing

Data cleaning

The dataset was generated by Mendiola et al.2020 to study the activation profiles of microglia and monocytes in EAE, an animal model of Multiple Sclerosis. The dataset consists of spinal cord tissue from two experimental conditions (EAE and healthy control). The data was acquired by quantifying single-cell gene expression by single-cell RNA-sequencing (10x Genomics) from spinal cord tissue by quantifying gene expression in each cell by single-cell RNA-sequencing (10x Genomics).

The acquired dataset contains measurements for 9079 cells and 16681 genes. To ensure technical noise does not affect downstream clustering of these cells, low quality cells need to be removed. Two common measures of cell quality are the number of expressed features (nFeature_RNA) and the total number of counts across all gene (nCount_RNA). Cells with very few expressed genes or counts are likely to be of low quality as the RNA has not been efficiently captured during library preparation. Another measure of quality is the percentage of reads that are mitochondrial genes (percent.mt). Dying cells express a high percentage of mitochondrial genes likely due to apoptotic processes, and need to be excluded from analysis. The distribution of these three metrics across each sample are shown in Figure 1A-C.

Low quality cells were defined as cells with greater than five percent mitochondrial gene expression and less than 250 reads per cell. Doublets were defined as cells with greater than 2500 genes/cell and were excluded from analysis. After filtering data, 6859 cells remained that met the quality control requirements. Data was additionally log-normalized.

As the total number of reads in a cell correlates strongly with the number of expressed genes (Figure 1D), a threshold for lowly expressed genes was not included in data filtering as it was likely accounted for by the read threshold. After quality control and accounting for technical noise, 6859 cells were considered for analysis.

par(mfrow=c(2,2))
plot1 <- VlnPlot(data, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"),
                   ncol     = 3,
                   pt.size  = 0.25)

plot1 <- VlnPlot(data, features = "nCount_RNA", pt.size = 0.25) + labs(
       title = "A.  Counts per cell") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

plot2 <- VlnPlot(data, features = "nFeature_RNA", pt.size = 0.25) + labs(
       title = "B.  Genes expressed per cell") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

plot3 <- VlnPlot(data, features = "percent.mt", pt.size = 0.25) + labs(
       title = "C.  Percent mitochondrial gene expression") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

plot4 <- FeatureScatter(data, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + labs(
       title = "D.") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

grid.arrange(plot1, plot2, plot3, plot4, layout_matrix = rbind(c(1,2,3),
                                                              c(4,NA,NA)))
\label{fig:figs} Figure 1. Quality control metrics per sample. Shown is the total number counts per cell (nCount_RNA, A), gene expression per cell (nFeature _RNA, B) and percent mitochondrial gene expression (percent.mt, C). (D) Correlation between total number of counts and number of expressed genes per cell. EAE, Experimental Autoimmune Encephalomyelitis.

Figure 1. Quality control metrics per sample. Shown is the total number counts per cell (nCount_RNA, A), gene expression per cell (nFeature _RNA, B) and percent mitochondrial gene expression (percent.mt, C). (D) Correlation between total number of counts and number of expressed genes per cell. EAE, Experimental Autoimmune Encephalomyelitis.

data <- data %>%
      subset(
        nFeature_RNA > 250 &   # Remove cells with < 250 detected genes
          nFeature_RNA < 2500 &  # Remove cells with > 2500 detected genes (could be doublets)
          percent.mt < 5     # Remove cells with > 5% mitochondrial read
      )

ncol(data)
## [1] 6859

Dimensionality Reduction

Defining variable features

Prior to characterizing highly variable genes, the data was log normalized. Highly variable genes are identified to focus on cells that drive heterogeneity across the population of cells. This requires calculating the variance in expression of each gene across the dataset.

data.norm <- NormalizeData(data, normalization.method = "LogNormalize", scale.factor = 10000)
all.genes <- rownames(data.norm)
data.norm <- FindVariableFeatures(data.norm, selection.method = "mean.var.plot")

length(VariableFeatures(data.norm))
## [1] 688

Highly variable genes are defined as genes that display a mean to variance ratio greater than 0.1 and a dispersion greater than one. This identified 688 highly variable genes (Figure 2), which are ranked to focus on genes with larger biological variability.

top10 <- head(VariableFeatures(data.norm), 10)
plot1 <- VariableFeaturePlot(data.norm)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
plot1 + plot2 
\label{fig:figs} Figure 2. Dispersion values for each gene, plotted against log-transformed average expression. Highly variable genes are annotated.

Figure 2. Dispersion values for each gene, plotted against log-transformed average expression. Highly variable genes are annotated.

Principal component analysis

PC selection

Next, dimensionality reduction is performed on the high variable genes to determine if there is any substructure in the dataset. Prior to performing dimensionality reduction, the dataset was scaled and centred.

To overcome technical noise in the expression of a single gene, clusters (cell subsets) are classified based on their PCA scores that have been derived from their common expression of a set of highly variable genes. Thus, each principal component (PC) represents a combination of information across a correlated gene set. This identified 50 PCs.

data.pca <- ScaleData(data.norm, features = all.genes, do.center = T, do.scale = T)
## Centering and scaling data matrix
data.pca <- RunPCA(data.pca, features = VariableFeatures(object = data.norm))
## PC_ 1 
## Positive:  Cst3, P2ry12, Cx3cr1, Cd81, Tmem119, Sparc, Gpr34, Hexb, Olfml3, Selplg 
##     C1qa, Sepp1, Siglech, Fcrls, Lgmn, Ckb, Ltc4s, Rhob, Tgfbr1, Ly86 
##     Rnase4, Itgb5, Hpgd, Hpgds, Slco2b1, Plxdc2, Timp2, P2ry13, Serpine2, Arhgap5 
## Negative:  Lgals3, Prdx5, Ifitm3, AA467197, Tgfbi, Il1b, Plac8, AW112010, Tmsb10, Cstb 
##     Crip1, Ass1, Srgn, H2-Ab1, Clec7a, Ly6c2, Wfdc17, Ifitm2, Calm1, Il1rn 
##     Lgals1, F10, Ifi27l2a, Ms4a4c, Tnfaip2, Actg1, Clec4e, Ccl5, Apoe, H2-Eb1 
## PC_ 2 
## Positive:  Ctsb, H2-Eb1, H2-Ab1, Ms4a7, Gpnmb, Cd63, Spp1, Acp5, Arg1, Cst7 
##     Ccl5, Fabp5, Apoc2, Apoe, AW112010, Cstb, Cd63-ps, Cox6a2, Prdx1, Cd72 
##     Rgs1, Gm6166, C1qa, Lgmn, Syngr1, Clec4n, Slc7a2, Niacr1, Lgals3, Hmox1 
## Negative:  Btg2, Ier2, Junb, Jun, Atf3, Egr1, Ppp1r15a, Kdm6b, Klf2, Ier5 
##     Marcksl1, Nfkbiz, Irf1, Ifi203, Cxcl10, Icam1, Actg1, Fos, S100a4, Gadd45b 
##     Fosb, Ifitm6, Mndal, Cd83, Cxcl11, Gem, Tnpo3, Mxd1, Pyhin1, Dnajb1 
## PC_ 3 
## Positive:  S100a4, Ifitm6, Mndal, Ifi203, Pyhin1, Ifit3, Ifit1, Ms4a4c, Ifit2, Isg15 
##     Lsp1, Gm4955, Plac8, Oasl1, Gm9733, Lrg1, Mxd1, Tspan13, Usp18, Ifitm1 
##     Ifitm2, Ccr2, Cmpk2, Slfn4, Wfdc17, I830012O16Rik, Cxcl10, Cytip, Dmkn, Rsad2 
## Negative:  Nfkbiz, Fos, Egr1, Fosb, Junb, Ier2, Ppp1r15a, Klf2, Tnf, Jun 
##     Il1a, Gpr84, Sgk1, Btg2, Ier5, Gem, Atf3, Mt1, Arg1, Ccl3 
##     Ccrl2, Slc7a11, Hmox1, Dnajb1, Klf4, Cited2, Cxcl2, Icam1, Ctsb, Gadd45b 
## PC_ 4 
## Positive:  F10, Arg1, Ccl6, Maf, S100a8, Slc7a2, Clec4n, Slc7a11, Chi3l3, Hp 
##     Slpi, Inhba, 1190002H23Rik, Gpr34, Hpgd, Clec4e, Sgk1, Hopx, Clec4a2, Cd24a 
##     Ninj1, Nos2, Ltc4s, Sox4, Ecscr, Adrb2, F13a1, Cxcl2, Tgfbi, Gm11206 
## Negative:  Cst7, Cd72, Ccl12, Cox6a2, Ccl2, Iigp1, Cd63, Cd63-ps, Ly86, H2-Oa 
##     Stmn1, Ccl4, Cd180, Ifi27l2a, Ifit3, Cd83, H2-Eb1, Niacr1, G530011O06Rik, Ctsb 
##     Phgdh, Cxcl13, H2-Ab1, D17H6S56E-5, Ifit2, Ccl7, Ccl3, Cxcl10, C1qa, Lag3 
## PC_ 5 
## Positive:  Irg1, Inhba, Nos2, Ifi205, Il1rn, Il1a, Clec4e, Niacr1, Slc7a11, Serpina3g 
##     Cd40, Rsad2, Upp1, Fam26f, Slc7a2, Il12a, Bcl2a1a, Cnn3, Isg15, Cxcl9 
##     Cxcl11, Ninj1, Procr, Ccrl2, Bhlhe40, Ifit1, Oasl1, Clec7a, Clec4n, AA467197 
## Negative:  Ccr2, Fos, Ifitm2, Btg2, Apoc2, Tsc22d3, Lsp1, Thbs1, Fosb, Apoe 
##     Ifitm6, Ifitm1, Ifngr1, H2-DMb2, Il1r2, Klrd1, Cytip, Klf2, Crip1, Lgals1 
##     Junb, Ms4a7, Ier2, Tmem123, Egr1, Klf4, Anxa1, F13a1, Ppp1r15a, H1f0
print(data.pca[["pca"]], nfeatures = 5)
## PC_ 1 
## Positive:  Cst3, P2ry12, Cx3cr1, Cd81, Tmem119 
## Negative:  Lgals3, Prdx5, Ifitm3, AA467197, Tgfbi 
## PC_ 2 
## Positive:  Ctsb, H2-Eb1, H2-Ab1, Ms4a7, Gpnmb 
## Negative:  Btg2, Ier2, Junb, Jun, Atf3 
## PC_ 3 
## Positive:  S100a4, Ifitm6, Mndal, Ifi203, Pyhin1 
## Negative:  Nfkbiz, Fos, Egr1, Fosb, Junb 
## PC_ 4 
## Positive:  F10, Arg1, Ccl6, Maf, S100a8 
## Negative:  Cst7, Cd72, Ccl12, Cox6a2, Ccl2 
## PC_ 5 
## Positive:  Irg1, Inhba, Nos2, Ifi205, Il1rn 
## Negative:  Ccr2, Fos, Ifitm2, Btg2, Apoc2

The number of PCs will be used to inform downstream clustering. Thus, it is crucial to determine the true dimensionality for the dataset. It is evident from Figure 3 that the majority of variation is explained by PC1 to ~PC10. The optimal number of PCs was determined as the last point where change in percent variation is more than 0.1%. This identified the optimal dimensionality of the dataset as 11 PCs.

ElbowPlot(data.pca)
\label{fig:figs} Figure 3. Standard deviation of each principal component (PC).

Figure 3. Standard deviation of each principal component (PC).

mat <- Seurat::GetAssayData(data.pca, assay = "RNA", slot = "scale.data")
pca <- data.pca[["pca"]]
total_variance <- sum(matrixStats::rowVars(mat))
eigValues = (pca@stdev)^2  ## EigenValues
varExplained = (eigValues / total_variance)* 100

plot(varExplained, ylab="Percent of variance", xlab="Principal Component")
\label{fig:figs} Figure 4. Percent of variance explained by each principal component.

Figure 4. Percent of variance explained by each principal component.

pct <- data.pca@reductions$pca@stdev/sum(data.pca@reductions$pca@stdev) * 100
cum <- cumsum(pct)

# Determine which PC exhibits cumulative percent greater than 90% and % variation associated with the PC as less than 5
co1 <- which(cum > 90 & varExplained < 5)[1]
# Determine the difference between variation of PC and subsequent PC
co2 <- sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > 0.1),  decreasing = T)[1] + 1 # last point where change of % of variation is more than 0.1%.

# Minimum of the two calculation
pcs <- min(co1, co2) 
pcs
## [1] 11

Clustering by optimal number of PCs

Cells separate into clear clusters in the PCA plot (Figure 5), corresponding to distinct disease states (healthy versus control conditions). This is consistent with the presence of new cell populations that infiltrate the spinal cord in response to the EAE disease state.

To determine the cellular substructure of the dataset, K-means clustering is performed on the dataset (dimensionality = PCs 1:11, resolution = 0.5). This identified 14 related cell clusters, as shown by Figure 5.

data.pca <- FindNeighbors(data.pca, reduction = "pca", dims = 1:pcs)
data.pca <- FindClusters(data.pca, pc.use = 1:pcs)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6859
## Number of edges: 230062
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8341
## Number of communities: 14
## Elapsed time: 0 seconds
DimPlot(data.pca, reduction = "pca")
\label{fig:figs}Figure 5. Linear dimension reduction and k-means clustering identifies 14 distinct cell clusters. (A) Original sample membership of each cell. (B) Assigned cluster identity by k-means clustering.

Figure 5. Linear dimension reduction and k-means clustering identifies 14 distinct cell clusters. (A) Original sample membership of each cell. (B) Assigned cluster identity by k-means clustering.

data.pca <- StashIdent(data.pca, 
                        save.name = "cluster.id") 

plot1 <- DimPlot(data.pca, group.by = "orig.ident") + labs(title = "A. Original sample membership") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 
plot2 <- DimPlot(data.pca, group.by = "cluster.id",  label = T) + labs(title = "B.  Assigned cluster ID") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

plot1 + plot2
\label{fig:figs}Figure 5. Linear dimension reduction and k-means clustering identifies 14 distinct cell clusters. (A) Original sample membership of each cell. (B) Assigned cluster identity by k-means clustering.

Figure 5. Linear dimension reduction and k-means clustering identifies 14 distinct cell clusters. (A) Original sample membership of each cell. (B) Assigned cluster identity by k-means clustering.

Cluster annotation

To assign cell type identity to each cluster, expression of the gene P2ry12, a known microglia-specific gene, was used to identify microglia. Expression of Vim, a known monocyte-specific gene, was used to identify monocyte clusters. This identified eight monocyte subclusters and five microglia subclusters. Microglia were best explained by PC1, whereas monocytes were best explained by PC2 (Figure 6).

Idents(data.pca) <- "cluster.id"
data.pca <- RenameIdents(data.pca,
                            "0" = "Monocytes",
                            "1" = "Monocytes",
                            "2" = "Microglia",
                            "3" = "Monocytes", 
                            "4" = "Microglia",
                            "5" = "Monocytes",
                            "6" = "Microglia",
                            "7" = "Microglia",
                            "8" = "Monocytes", 
                          "9" = "Monocytes", 
                          "10" = "Monocytes", 
                          "11" = "Microglia", 
                          "12" = "Monocytes", 
                          "13" = "Monocytes")
data.pca <- StashIdent(data.pca, save.name = "Cell_type")

Idents(data.pca) <- "cluster.id"
data.pca <- RenameIdents(data.pca,
                            "0" = "Monocyte1",
                            "1" = "Monocyte2",
                            "2" = "Microglia1",
                            "3" = "Monocyte3", 
                            "4" = "Microglia2",
                            "5" = "Monocyte4",
                            "6" = "Microglia3",
                            "7" = "Microglia4",
                            "8" = "Monocyte5", 
                          "9" = "Monocyte6", 
                          "10" = "Monocytes", 
                          "11" = "Microglia5", 
                          "12" = "Monocyte7", 
                          "13" = "Monocyte8")
data.pca <- StashIdent(data.pca, save.name = "Cell_subset")

plots <- list()
p1 <- FeaturePlot(data.pca, feature = "P2ry12") + labs(
       title = "A.  P2ry12") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 
p2 <- FeaturePlot(data.pca, feature = "Vim") + labs(
       title = "B.  Vim") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

p3 <- DimPlot(data.pca, group.by = "Cell_type") + labs(
       title = "C.  Cell type") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

p4 <- DimPlot(data.pca, group.by = "Cell_subset") + labs(
       title = "D.  Cell subset") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 

gridExtra::grid.arrange(p1, p2, p3, p4, layout_matrix = rbind(c(1,3),
                                                              c(2,4)))
\label{fig:figs} Figure 6. Expression of genes (A) P2ry12 and (B) Vim used for (C) cell type classification. (D) Clusters correspond to eight monocytes clusters and five microglia subsets.

Figure 6. Expression of genes (A) P2ry12 and (B) Vim used for (C) cell type classification. (D) Clusters correspond to eight monocytes clusters and five microglia subsets.

Random Forest

Phylogenetic decision tree

After identifying two cell types and 14 cell subsets, a phylogenetic decision tree is visualized to determine if identity (cell subset or cell type) can be predicted by a select number of genes (Figure 7). From the decision tree, clear separation is evident between the two cell types, which are further separated into distinct cell subsets (Microglia1-5, Monocytes1-8), suggesting these both cell subset and cell type may be predicted by a random forest classifier.

Idents(data.pca) <- "Cell_subset"
tree <- BuildClusterTree(data.pca, dims = 1:pcs)
PlotClusterTree(tree)
\label{fig:figs} Figure 7. Phylogenetic decision tree depicting classification of each cluster identity.

Figure 7. Phylogenetic decision tree depicting classification of each cluster identity.

summary(tree@tools$BuildClusterTree)
## 
## Phylogenetic tree: tree@tools$BuildClusterTree 
## 
##   Number of tips: 14 
##   Number of nodes: 13 
##   Branch lengths:
##     mean: 3.970079 
##     variance: 8.282716 
##     distribution summary:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
##  0.7538042  2.4316088  3.6763463  4.7236360 15.9174827 
##   No root edge.
##   First ten tip labels: Monocyte1 
##                         Monocyte2
##                         Microglia1
##                         Monocyte3
##                         Microglia2
##                         Monocyte4
##                         Microglia3
##                         Microglia4
##                         Monocyte5
##                         Monocyte6
##   No node labels.

Predicting cell subset identity

Defining important features

To determine genes that may be used to classify each cell subset, variable importance for each gene in classifying either (1) a particular cell subset or (2) a particular cell type was calculated on the training dataset. The top three genes with highest variable importance for each target are used as features (Table 1).

# Identify markers per cell subset
Idents(data.pca) <- "Cell_subset"
markers <- sctree::FindAllMarkers(
        data.pca,
        features = rownames(data.pca@assays$RNA@data),
        test.use = "RangerDE")
## Calculating cluster Monocyte1
## Calculating cluster Monocyte2
## Calculating cluster Microglia1
## Calculating cluster Monocyte3
## Calculating cluster Microglia2
## Calculating cluster Monocyte4
## Calculating cluster Microglia3
## Calculating cluster Microglia4
## Calculating cluster Monocyte5
## Calculating cluster Monocyte6
## Calculating cluster Monocytes
## Calculating cluster Microglia5
## Calculating cluster Monocyte7
## Calculating cluster Monocyte8
# Identify markers per cell type 
Idents(data.pca) <- "Cell_type"
cell.type.markers <- sctree::FindAllMarkers(
        data.pca,
        features = rownames(data.pca@assays$RNA@data),
        test.use = "RangerDE")
## Calculating cluster Monocytes
## Calculating cluster Microglia
set.seed(1)
Idents(data.pca) <- "cluster.id"
train1 <- RandomSubsetData(data.pca, 0.8)
test1 <- RandomSubsetData(data.pca, 0.2)

Idents(data.pca) <- "Cell_type"
train2 <- RandomSubsetData(data.pca, 0.8)
test2 <- RandomSubsetData(data.pca, 0.2)

cell.plot.markers <- do.call(rbind, lapply(split(cell.type.markers, cell.type.markers$cluster), head, 3))
plot.markers <- do.call(rbind, lapply(split(markers, markers$cluster), head, 3))

top_cell_markers <- cell.plot.markers$gene
top_markers <- plot.markers$gene
top_markers <- unique(top_markers) # remove duplicates

tree_fit.cell.type <- fit_ctree(train2, 
                      genes_use = top_cell_markers, 
                      cluster = "ALL")

tree_fit <- fit_ctree(train1, 
                      genes_use = top_markers, 
                      cluster = "ALL")

cell.plot.markers <- cell.plot.markers %>% rownames_to_column()
cell.marker.table <- cell.plot.markers[, c(4, 2)] %>% mutate_if(is.numeric, ~round(., 1))
plot.markers <- plot.markers %>% rownames_to_column()
marker.table <- plot.markers[, c(9, 4, 2)] %>% mutate_if(is.numeric, ~round(., 1))
colnames(marker.table) <- c("Cell subset (cluster ID)", "Gene", "Importance")

ft <- marker.table %>% flextable()
ft <- autofit(ft)
ft

Cell subset (cluster ID)

Gene

Importance

Monocyte1

Apoe

65.4

Monocyte1

H2-Eb1

68.3

Monocyte1

H2-Aa

80.6

Monocyte2

AA467197

88.1

Monocyte2

Clec4n

38.9

Monocyte2

Inhba

109.6

Microglia1

Cst3

56.2

Microglia1

P2ry12

12.1

Microglia1

Gpr34

15.6

Monocyte3

Plac8

51.3

Monocyte3

Ccr2

47.0

Monocyte3

Chi3l3

5.9

Microglia2

Gpr34

31.7

Microglia2

P2ry12

55.5

Microglia2

Cst3

38.5

Monocyte4

Fabp5

101.2

Monocyte4

Gpnmb

70.4

Monocyte4

Arg1

12.5

Microglia3

Ccl12

6.9

Microglia3

Cd63

25.0

Microglia3

Cst7

50.8

Microglia4

Klf2

77.9

Microglia4

Jun

101.5

Microglia4

Egr1

93.7

Monocyte5

S100a8

2.2

Monocyte5

S100a9

1.5

Monocyte5

Cxcl10

28.9

Monocyte6

Rsad2

29.8

Monocyte6

Isg15

30.6

Monocyte6

Ifit1

36.9

Monocytes

Ccl17

11.2

Monocytes

Ifitm1

37.8

Monocytes

Ccnd2

11.0

Microglia5

Ccl4

6.9

Microglia5

Ccl3

4.9

Microglia5

Ccl12

2.1

Monocyte7

Ifng

0.2

Monocyte7

Ccl1

0.1

Monocyte7

Nkg7

3.2

Monocyte8

Ccr7

4.7

Monocyte8

Ccl22

1.3

Monocyte8

Serpinb6b

6.7

Model validation

Next, features shown in Table 1 were used to train a random forest model to predict cell subset identity (n identities = 14).The performance of this model was assessed by evaluating accuracy and precision. Overall, this model was 63.4% accurate (Kappa = 0.594) in predicting cell subset identity. Additional measurements of model performance are shown in Table 2. Several cell subsets were frequently misclassified. As shown by the confusion matrix (Figure 8A), cluster 7 (Microglia4) and cluster 2 (Microglia1) were the most commonly misclassified cell subset (0% correctly classified), followed by cluster 3 (Monocyte3, 52.42% correct) and cluster 13 (Monocyte8, 60.08% correct). This is likely due to poor separation of these clusters, as shown on the PCA plot (Figure 8C), suggestion co-expression in the variable genes across various cell subsets.

testset <- as.data.frame(test1, top_markers, fix_names = FALSE)
predicted <- predict(tree_fit, testset)
gating_genes <- names(partykit::varimp(tree_fit[[1]]))
confusion_matrix <- table(data.frame(predicted = predicted,
                                         cluster = testset$ident))
test1[["predicted"]] <- predicted
test1[["correctly_classified"]] <- predicted == Idents(test1)
confusion_tbl <- table(
    Predicted = predicted, 
    Actual = test1@meta.data$cluster.id)
Actual = test1@meta.data$cluster.id

plot3 <- DimPlot(test1, group.by = "Cell_subset", label = F) + labs(
       title = "C.  Real subset identity") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 
plot2 <- DimPlot(test1, group.by = "correctly_classified", label = F) + labs(
       title = "B.  Accuracy of predicted cell type identity") +
   theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12)) 
plot1 <- sctree::autoplot(as.frequency.matrix(confusion_tbl), show_number = TRUE) + labs(
       title = "A. Random forest confusion matrix") + theme(plot.title = element_text(hjust = 0, vjust=2.12, size=12, face="bold")) 

grid.arrange(plot1, plot2, plot3, layout_matrix = rbind(c(1,1,1,1,2,2,2),
                                               c(1,1,1,1,3,3,3)))
\label{fig:figs} Figure 8. (A) Real identities of subsets are based on top gene expression. (B) A random forest model was used to predict cell subset identity the top three genes per cell subset. TRUE denonates correct classification, whereas FALSE denotes incorrect classification. (C) Random forest confusion matrix depicting the predicted cell subset versus the actual cell susbset identity.

Figure 8. (A) Real identities of subsets are based on top gene expression. (B) A random forest model was used to predict cell subset identity the top three genes per cell subset. TRUE denonates correct classification, whereas FALSE denotes incorrect classification. (C) Random forest confusion matrix depicting the predicted cell subset versus the actual cell susbset identity.

a <- test1@meta.data[, c(10, 13)] 
colnames(a) <- c('obs', 'pred')

stat.t <- as.data.frame(multiClassSummary(a, lev = levels(a$obs))) %>% rownames_to_column(var = "Statistic") %>% mutate_if(is.numeric, ~round(., 3))
colnames(stat.t) <- c('Statistic', 'Value')

stat.t <- as.data.frame(multiClassSummary(a, lev = levels(a$obs))) %>% rownames_to_column(var = "Statistic") %>% mutate_if(is.numeric, ~round(., 3))
colnames(stat.t) <- c('Statistic', 'Value')


ft <- stat.t %>% flextable()
ft <- autofit(ft)
ft

Statistic

Value

Accuracy

0.650

Kappa

0.607

Mean_F1

Mean_Sensitivity

0.644

Mean_Specificity

0.972

Mean_Pos_Pred_Value

Mean_Neg_Pred_Value

0.973

Mean_Precision

Mean_Recall

0.644

Mean_Detection_Rate

0.046

Mean_Balanced_Accuracy

0.808

Predicting cell type

Defining important features

As the performance of the model in predicting cell subset was variable, a model was then created to classify the two cell types rather than cell susbet. Features important in defining cell types were defined using Ranger (see methods). This identified six features that were important in classifying each cell type (Table 3).

cell.marker.table <- cell.plot.markers[, c(9, 4, 2)] %>% mutate_if(is.numeric, ~round(., 1)) 
colnames(cell.marker.table) <- c("Cell type", "Gene", "Importance") 



ft <- cell.marker.table %>% flextable()
ft <- autofit(ft)
ft

Cell type

Gene

Importance

Monocytes

Plac8

16.9

Monocytes

AA467197

34.3

Monocytes

Lyz2

80.5

Microglia

Cst3

246.8

Microglia

Cd81

307.8

Microglia

P2ry12

113.1

Model validation

The features shown in Table 3 were then used to train a random forest model and predict cell type identity. As above, the dataset was split into a training (80% of cells) and testing (20% of cells) sets. After training, the performance of the model in predicting cell type (either ‘Microglia’ or ‘Monocyte’) was assessed (Figure 9, Table 4). As shown in Figure 11, the model predicts cell type 99.7% accuracy (Kappa = 0.994).

\label{fig:figs} Figure 9. (A) Real identities of cell type are based on top gene expression. (B) Random forest were used to classify cells by the top six genes. TRUE denonates correct classification, whereas FALSE denotes incorrect classification. (C) Confusion matrix depicting the assigned classification of each cell type versus the actual cell type identity.

Figure 9. (A) Real identities of cell type are based on top gene expression. (B) Random forest were used to classify cells by the top six genes. TRUE denonates correct classification, whereas FALSE denotes incorrect classification. (C) Confusion matrix depicting the assigned classification of each cell type versus the actual cell type identity.

a <- test2@meta.data[, c(11, 13)] 
colnames(a) <- c('obs', 'pred')

stat.t <- as.data.frame(multiClassSummary(a, lev = levels(a$obs))) %>% rownames_to_column(var = "Statistic") %>% mutate_if(is.numeric, ~round(., 3))
colnames(stat.t) <- c('Statistic', 'Value')

ft <- stat.t %>% flextable()
ft <- autofit(ft)
ft

Statistic

Value

Accuracy

0.996

Kappa

0.991

F1

0.996

Sensitivity

0.995

Specificity

0.996

Pos_Pred_Value

0.998

Neg_Pred_Value

0.993

Precision

0.998

Recall

0.995

Detection_Rate

0.601

Balanced_Accuracy

0.996

Application to flow cytometry

The high accuracy of this model indicates combinations of these genes may be used to identify these cells by flow cytometry. Simulated flow cytometry plots using the top six genes for classifying the two cell types are shown in Figure 10, demonstrating clear separation between these two cell types.

Idents(data.pca) <- "Cell_type"
plot_flowstyle(data.pca, top_cell_markers, classif_col = "ident", warn = FALSE)
\label{fig:figs} Figure 10. Faceted scatter plots simulating a flow cytometry gating strategy. Each plot shows the separation of monocytes (red) and microglia (blue) by each combination of features. Included features are defined as the top six genes, as defined by variable importances.

Figure 10. Faceted scatter plots simulating a flow cytometry gating strategy. Each plot shows the separation of monocytes (red) and microglia (blue) by each combination of features. Included features are defined as the top six genes, as defined by variable importances.

In conclusion, six out of 16681 genes can accurately predict cell type identity. These genes may be translated to other single-cell applications, such as flow cytometry as demonstrated by the simulated gating strategy. Predicting cell subset identity also performed well, however, this was variable between cell types and required a larger number of genes, therefore increasing the complexity of the classification and making it less likely to be translatable to flow cytometry.